Search CORE

39 research outputs found

SRS: A FRAMEWORK FOR DEVELOPING MALLEABLE AND MIGRATABLE PARALLEL APPLICATIONS FOR DISTRIBUTED SYSTEMS

Author: Arabe J. N. C.
Foster I.
JACK J. DONGARRA
Koo R.
SATHISH S. VADHIYAR
Tannenbaum T.
Publication venue: 'World Scientific Pub Co Pte Lt'
Publication date
Field of study

Holistic Slowdown Driven Scheduling and Resource Management for Malleable Jobs

Author: Cirne W.
Kale L. V.
Kumbhar P.
Lopez V.
Lucero A.
Ludwig Walter
M. Cera
Vadhiyar Sathish S.
Yoo A. B.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2019
Field of study

In job scheduling, the concept of malleability has been explored since many years ago. Research shows that malleability improves system performance, but its utilization in HPC never became widespread. The causes are the difficulty in developing malleable applications, and the lack of support and integration of the different layers of the HPC software stack. However, in the last years, malleability in job scheduling is becoming more critical because of the increasing complexity of hardware and workloads. In this context, using nodes in an exclusive mode is not always the most efficient solution as in traditional HPC jobs, where applications were highly tuned for static allocations, but offering zero flexibility to dynamic executions. This paper proposes a new holistic, dynamic job scheduling policy, Slowdown Driven (SD-Policy), which exploits the malleability of applications as the key technology to reduce the average slowdown and response time of jobs. SD-Policy is based on backfill and node sharing. It applies malleability to running jobs to make room for jobs that will run with a reduced set of resources, only when the estimated slowdown improves over the static approach. We implemented SD-Policy in SLURM and evaluated it in a real production environment, and with a simulator using workloads of up to 198K jobs. Results show better resource utilization with the reduction of makespan, response time, slowdown, and energy consumption, up to respectively 7%, 50%, 70%, and 6%, for the evaluated workloads

arXiv.org e-Print Archive

Crossref

UPCommons. Portal del coneixement obert de la UPC

GrADSolve—a grid-based RPC system for parallel computing with application-level scheduling

Author: Arbenz
Balay
Berman
Bershad
Birrell
Butler
Casanova
Chang
Denis
Denis
Denis
Foster
Geist
Jack J. Dongarra
Maassen
Petitet
René
Sathish S. Vadhiyar
Sato
Wolski
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

HyPar: A divide-and-conquer model for hybrid CPU-GPU graph processing

Author: Panja Rintu
Vadhiyar Sathish S
Publication venue: ACADEMIC PRESS INC ELSEVIER SCIENCE
Publication date
Field of study

Efficient processing of graph applications on heterogeneous CPU-GPU systems require effectively harnessing the combined power of both the CPU and GPU devices. This paper presents HyPar, a divide-and-conquer model for processing graph applications on hybrid CPU-GPU systems. Our strategy partitions the given graph across the devices and performs simultaneous independent computations on both the devices. The model provides a simple and generic API, supported with efficient runtime strategies for hybrid executions. The divide-and-conquer model is demonstrated with five graph applications and using experiments with these applications on a heterogeneous system it is shown that our HyPar strategy provides equivalent performance to the state-of-art, optimized CPU-only and GPU-only implementations of the corresponding applications. When compared to the prevalent BSP approach for multi-device executions of graphs, our HyPar method yields 74%-92% average performance improvements

Open Access Repository of IISc Research Publications

An Efficient MPI_Allgather for Grids

Author: Gupta Rakhi
Vadhiyar Sathish S
Publication venue: ACM Press
Publication date
Field of study

Allgather is an important MPI collective communication. Most of the algorithms for allgather have been designed for homogeneous and tightly coupled systems. The existing algorithms for allgather on Gridsystems do not efficiently utilize the bandwidths available on slow wide-area links of the grid. In this paper, we present an algorithm for allgather on grids that efficiently utilizes wide-area bandwidths and is also wide-area optimal. Our algorithm is also adaptive to gridload dynamics since it considers transient network characteristics for dividing the nodes into clusters. Our experiments on a real-grid setup consisting of 3 sites show that our algorithm gives an average performance improvement of 52% over existing strategies

Open Access Repository of IISc Research Publications

ACCT: Automatic Collective Communications Tuning

Author: Fagg Graham E
Vadhiyar Sathish S
Publication venue: Springer Nature
Publication date: 01/01/2000
Field of study

The University of Manchester - Institutional Repository

Strategies for Rescheduling Tightly-Coupled Parallel Applications in Multi-Cluster Grids

Author: Sanjay HA
Vadhiyar Sathish S
Publication venue: Springer
Publication date
Field of study

As computational Grids are increasingly used for executing long running multi-phase parallel applications, it is important to develop efficient rescheduling frameworks that adapt application execution in response to resource and application dynamics. In this paper, three strategies or algorithms have been developed for deciding when and where to reschedule parallel applications that execute on multi-cluster Grids. The algorithms derive rescheduling plans that consist of potential points in application execution for rescheduling and schedules of resources for application execution between two consecutive rescheduling points. Using large number of simulations, it is shown that the rescheduling plans developed by the algorithms can lead to large decrease in application execution times when compared to executions without rescheduling on dynamic Grid resources. The rescheduling plans generated by the algorithms are also shown to be competitive when compared to the near-optimal plans generated by brute-force methods. Of the algorithms, genetic algorithm yielded the most efficient rescheduling plans with 9-12% smaller average execution times than the other algorithms

Open Access Repository of IISc Research Publications